Table of Contents

Introduction

Daily fantasy sports (DFS) are a subset of fantasy sports, where participants construct lineups based on the games occuring on a given day [https://en.wikipedia.org/wiki/Daily_fantasy_sports]. Lineups are subject to various constraints, such as a salary cap and having a minimum number of players at each position. As with other fantasy sports, player’s fantasy points are based on their actual performance in real life. As a result, a key component of being a successful DFS player is the ability to project a player’s total points for a given night.

This study focuses on the NBA DFS, with a particular emphasis on examining factors that influence a player’s score. The NBA data used in the study is sourced from Erik Berg’s API [https://erikberg.com/api]. The study covers NBA data from the 2012-2013 season, up until the 2015-2016 season. Fantasy points are calculated using Fanduel’s scoring system, described below. Unlike other sports, all positions in the NBA are scored using the same system.

Source: https://www.fanduel.com/rules

Data Import & Cleaning

Two csv files were created using the NBA data API. The player_data file contains a record for each player that recorded any statistics during a game. The event data table contains a record for each game, identified by a unique gameID. This gameID can be used to join the two data sources.

Before performing producing any plots or analysis, the data is imported and cleaned.

##import packages
library(data.table)
library(dplyr)
library(dtplyr)
library(tidyr)
library(ggplot2)
source("dlin.R")
library(RColorBrewer)
library(reshape2)
library(corrplot)
source("roll_variable.R")
library(RcppRoll)
library(rms)
source("multiplot.R")
library(plyr)

palette = brewer.pal("YlGnBu", n=9)

##read in data
player_data = read.csv("data/player_data.csv")
event_data = read.csv("data/event_data.csv")

##convert to data.tables
event_data = as.data.table(event_data)
player_data = as.data.table(player_data)

#rename and drop unnecessary columns
player_data[ ,c("X","sport") := NULL]
event_data[,X:=NULL]
setnames(player_data,old = c('X3FGA','X3FGM'),new = c('3FGA','3FGM'))

#get rid of duplicate rows in event_data
setkey(event_data,gameID)
event_data = unique(event_data)

#convert positions coded as 'F' to 'SF', 'G' to 'SG'
player_data[position == 'F',position:= 'SF']
player_data[position == 'G',position:= 'SG']

Feature Engineering

Creating powerful predictions relies not only on a good model, but significant features that represent underlying effects.

As the original data only contains base-level statistics, the first step involves creating a variable for the Fanduel fantasy points. Other data cleaning and merging steps are documented below.

##calculate fanduel points
player_data[,fd:= 3*`3FGM`+2*(FGM-`3FGM`)+1*FTM+1.2*rebounds+1.5*assists+2*blocks+2*steals-1*turnovers]

##create a seaosn variable that can be used to distinguish between different seasons
player_data[,date:=as.Date(date)] ##convert string date to actual date
player_data[,date_num:=as.numeric(date)]
player_data[,season_code:=20122013]
player_data[date >= '2013-10-28' & date <= '2014-06-16', season_code:= 20132014]
player_data[date >= '2014-10-28' & date <= '2015-06-16', season_code:= 20142015]
player_data[date >= '2015-10-27' & date <= '2016-06-19', season_code:= 20152016]

event_data = event_data %>% join(player_data[,.N,by=.(gameID,date)][,.(gameID,date)])

event_data[,season_code:=20122013]
event_data[date >= '2013-10-28' & date <= '2014-06-16', season_code:=20132014]
event_data[date >= '2014-10-28' & date <= '2015-06-16', season_code:=20142015]
event_data[date >= '2015-10-27' & date <= '2016-06-19', season_code:=20152016]
##filter out players who didnt play
player_data = filter(player_data,minutes > 0)

##get team info
team_data=event_data[,.(gameID,home_team,away_team)]
setkey(player_data,gameID)
setkey(team_data,gameID)
player_data = player_data[team_data,nomatch = 0] ##merge the team data to the player_data table

##create home/away variable
player_data[,homeaway:= 1]
player_data[team==away_team,homeaway:= 0]

Next, a function is created to calculate rolling averages for any statistic. The function takes in a data frame, a target field for the rolling averages, and a window variable in the form seq(starting_window,max_window,increment).

roll_variable_mean = function(d, target, windows) {
  require(dplyr)
  require(lazyeval)
  
  exprl = list()
  i = 1
  
  for (x in windows) {
    exprl[[i]] = interp(~ lag(roll_mean(tar, w, align = 'right', fill = NA), 1), tar=as.name(target), w=x)
    i = i + 1
  }
  names = paste(target, windows, sep="_")
  exprl = setNames(exprl, names)
  
  d = mutate_(d, .dots = exprl)
  
  for (n in 2:length(names)) {
    expr = interp(~ ifelse(is.na(long), short, long), long=as.name(names[n]), short=as.name(names[n-1]))
    exprl = setNames(list(expr), names[n])
    d = mutate_(d, .dots = exprl)
  }
  
  return(d)
}



##add rolling variables
window_size = seq(5,55,10)
player_data = player_data %>% group_by(player) %>% arrange(date) %>% roll_variable_mean(., 'fd', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>% roll_variable_mean(., 'minutes', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>% roll_variable_mean(., 'FGA', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>% roll_variable_mean(., 'FTA', window_size)


window_size = seq(1,3,1)
player_data = player_data %>% group_by(player) %>% arrange(date) %>% roll_variable_mean(., 'fd', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>% roll_variable_mean(., 'minutes', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>% roll_variable_mean(., 'FGA', window_size)
player_data = player_data %>% group_by(player) %>% arrange(date) %>% roll_variable_mean(., 'FTA', window_size)

A practical hypothesis is that players/teams that are playing multiple games in a row will perform worse as a result of fatigue. To test this hypothesis, a binary back-to-back feature is created which signals if the team is playing a back-to-back game. A variable is also created to signal if the team’s opponent is playing a back-to-back game.

##BACK TO BACK GAME VARIABLE
days_rest = player_data[,.N,by = .(team,date_num)][,.(team,date_num)][order(team,date_num)]
days_rest$date_diff = ave(days_rest$date_num, days_rest$team, FUN = function(x) c(10, diff(x)))
days_rest$b2b = ifelse(days_rest$date_diff == 1,1,0)

##add opponent field
player_data[,opponent:= home_team]
player_data[team == home_team,opponent:= away_team]

##join b2b,opp_b2b
setkey(player_data,team,date_num)
setkey(days_rest,team,date_num)
player_data = player_data[days_rest[,.(team,date_num,b2b)],nomatch = 0]

days_rest[,opp_b2b:= b2b][,b2b:= NULL]
setkey(player_data,opponent,date_num)
player_data = player_data[days_rest[,.(team,date_num,opp_b2b)],nomatch = 0]

In order to calculate team based statistics, the event_data table must be manipulated to have one record per team per game.

###TEAM BASED DATA
##currently the event data table has one record per game, with statistics for each team
colnames(event_data)
##  [1] "attendance"     "away_3FGA"      "away_3FGM"      "away_FGA"      
##  [5] "away_FGM"       "away_FTA"       "away_FTM"       "away_Q1"       
##  [9] "away_Q2"        "away_Q3"        "away_Q4"        "away_Q5"       
## [13] "away_Q6"        "away_Q7"        "away_Q8"        "away_assists"  
## [17] "away_blocks"    "away_fouls"     "away_points"    "away_rebounds" 
## [21] "away_steals"    "away_team"      "away_turnovers" "duration"      
## [25] "gameID"         "home_3FGA"      "home_3FGM"      "home_FGA"      
## [29] "home_FGM"       "home_FTA"       "home_FTM"       "home_Q1"       
## [33] "home_Q2"        "home_Q3"        "home_Q4"        "home_Q5"       
## [37] "home_Q6"        "home_Q7"        "home_Q8"        "home_assists"  
## [41] "home_blocks"    "home_fouls"     "home_points"    "home_rebounds" 
## [45] "home_steals"    "home_team"      "home_turnovers" "official_1"    
## [49] "official_2"     "official_3"     "official_4"     "season_type"   
## [53] "date"           "season_code"

As shown in the column names of event_data, team variables are preceded by ‘home_’ or ‘away_’. After splitting the table into one record per team per game, these variable names must be standardized by removing the “home/away” pre-name.

##calculate final scores for each game
event_data[,home_score:= sum(home_Q1,home_Q2,home_Q3,home_Q4,home_Q5,home_Q6,home_Q7,home_Q8,na.rm= TRUE),
           by = 1:NROW(event_data)]
event_data[,away_score:=sum(away_Q1,away_Q2,away_Q3,away_Q4,away_Q5,away_Q6,away_Q7,away_Q8,na.rm= TRUE),
           by = 1:NROW(event_data)]

##to calculate team based statistics, a data frame/table with 2 records per game -
##[cont'd] one for each team, is ideal
##the code below splits the event_data table into two tables, one for each team,
##[cont'd] standardizes the variable names, and then joins the two tables back together
team_variables = c('3FGA','3FGM','FGM','FGA','FTA','FTM',
                  'Q1','Q2','Q3','Q4','Q5','Q6','Q7',
                  'Q8','assists','blocks','fouls',
                  'points','rebounds','steals','turnovers')

away_variables = paste0('away_',team_variables)
home_variables = paste0('home_',team_variables)

##data table for the home team of each game
event_data_1 = event_data[,team:= home_team][,setdiff(colnames(event_data),away_variables),with = FALSE]
##data table for hte away team of each game
event_data_2 = event_data[,team:= away_team][,setdiff(colnames(event_data),home_variables),with = FALSE]

##change the column names to generic names (i.e. away_FGM --> FGM,home_FTA --> FTA)
setnames(event_data_1,old = home_variables,new = team_variables)
setnames(event_data_2,old = away_variables,new = team_variables)

##join the two tables back together
team_data = rbind(event_data_1,event_data_2)

Using the new team_data table, team based features can be calculated. The total FD points for each team and their opponent are calculated below. These features are then merged back to the player_data table.

##add date to team_data
setkey(team_data,gameID)
setkey(player_data,gameID)
team_data= team_data[player_data[,.N,by = .(gameID,date_num)][,.(gameID,date_num)],nomatch = 0][order(date_num)]

##define opponent in team_data table
team_data[,opponent:= home_team]
team_data[team == home_team,opponent:= away_team]

###team total FD points
team_tot_fd = player_data[,.(team_fd = sum(fd)),by = .(gameID,team)]
setkey(team_tot_fd,gameID,team)
setkey(player_data,gameID,team)
setkey(team_data,gameID,team)
player_data = player_data[team_tot_fd,nomatch = 0]
team_data = team_data[team_tot_fd,nomatch = 0]
  
###opponent total FD points
team_tot_fd[,`:=` (opponent = team, team = NULL,opp_fd = team_fd,team_fd = NULL)]
setkey(team_tot_fd,gameID,opponent)
setkey(player_data,gameID,opponent)
setkey(team_data,gameID,opponent)
player_data = player_data[team_tot_fd,nomatch = 0]
team_data = team_data[team_tot_fd,nomatch = 0]

Another potentially powerful feature is the amount of FD points a team gives up to certain positions. For example, if a team has weak defense against centers/power forwards, it should be reflected in how many FD points they give up to the opposing teams centers/PFs (on average).

The code below aggregates the FD points by game,team and position. These results are then joined back to team_data, and computed for the opponents. Rolling averages are then calculated, and finally merged back to player_data.

The final result in player_data will look like:

player position team opponent opp_fd_5 opp_p_fd_5 opp_g_fd_5
Chris Bosh PF MIA BOS 3.221 5.662 6.123

where the last three fields are the average FD points given up by Boston to opposing teams, C/PFs, and PG/SG/SFs respectively.

##position points summary
##calculate the total FD points for each team by position
##in some cases, teams played w/o a center. for those, use the post statistic (p_fd)
posn_sum = player_data[,.(posn_points = sum(fd)),by = .(gameID,team,position)]
posn_sum = posn_sum[,.(pg_fd = sum(ifelse(position == 'PG',posn_points,0)),
                     sg_fd = sum(ifelse(position == 'SG',posn_points,0)),
                     sf_fd = sum(ifelse(position == 'SF',posn_points,0)),
                     pf_fd = sum(ifelse(position == 'PF',posn_points,0)),
                     c_fd = sum(ifelse(position == 'C',posn_points,0)),
                     g_fd = sum(ifelse(position %in% c('PG','SG','SF'),posn_points,0)),
                     p_fd = sum(ifelse(position %in% c('PF','C'),posn_points,0))),
                     by = .(gameID,team)]


##merge positional points with team_data
setkey(posn_sum,gameID,team)
setkey(team_data,gameID,team)
team_data=team_data[posn_sum,nomatch=0]

##get team opponent data
team_data_opp = team_data[,.(gameID,team,pg_fd,sg_fd,sf_fd,pf_fd,c_fd,g_fd,p_fd,team_fd)]
setnames(team_data_opp,old = c("team","pg_fd","sg_fd","sf_fd","pf_fd","c_fd","g_fd","p_fd","team_fd"),
         new = c("opponent","opp_pg_fd","opp_sg_fd","opp_sf_fd","opp_pf_fd","opp_c_fd","opp_g_fd",
                 "opp_p_fd","opp_fd"))
setkey(team_data,gameID,opponent)
setkey(team_data_opp,gameID,opponent)
team_data = team_data[team_data_opp,nomatch = 0]

##rolling team statistics
##the following stat describes how many fantasy points a team has been giving up to opposing teams,
##...[cont'd] expressed as rolling averages over the last X games
team_data=team_data %>% group_by(team) %>% arrange(date_num) %>% roll_variable_mean(., 'opp_fd', window_size)
team_data=team_data %>% group_by(team) %>% arrange(date_num) %>% roll_variable_mean(., 'opp_g_fd', window_size)
team_data=team_data %>% group_by(team) %>% arrange(date_num) %>% roll_variable_mean(., 'opp_p_fd', window_size)

##join these features to player_data
##join team_data "team" on player_data "opponent"
rolling_team_variables = colnames(team_data)[grepl('fd_\\w*\\d', colnames(team_data))]
player_data = merge(player_data, team_data[,append(rolling_team_variables,c("gameID","team")),with = FALSE], 
                  by.x = c('gameID','opponent'), by.y = c('gameID','team'), all = FALSE)

Univariate Analysis

What is the structure of your dataset?

What is/are the main feature(s) of interest in your dataset?

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Did you create any new variables from existing variables in the dataset?

–> could look at distrbn of home/away for sanity check

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

glimpse(player_data)
## Observations: 103,233
## Variables: 76
## $ gameID      <fctr> 20121030-boston-celtics-at-miami-heat, 20121030-b...
## $ opponent    <fctr> BOS, BOS, BOS, BOS, BOS, BOS, BOS, BOS, BOS, BOS,...
## $ 3FGA        <int> 1, 1, 0, 3, 4, 3, 2, 0, 0, 2, 2, 4, 0, 0, 1, 3, 0,...
## $ 3FGM        <int> 0, 0, 0, 2, 2, 2, 1, 0, 0, 1, 0, 2, 0, 0, 1, 0, 0,...
## $ FGA         <int> 15, 7, 22, 4, 16, 7, 5, 1, 0, 2, 14, 15, 8, 11, 6,...
## $ FGM         <int> 8, 3, 10, 2, 10, 5, 4, 0, 0, 1, 9, 6, 4, 6, 5, 2, ...
## $ FTA         <int> 4, 2, 11, 0, 5, 8, 2, 0, 0, 0, 4, 9, 1, 4, 0, 4, 4...
## $ FTM         <int> 3, 2, 9, 0, 4, 7, 1, 0, 0, 0, 2, 9, 1, 3, 0, 4, 3,...
## $ assists     <int> 1, 11, 4, 1, 3, 2, 1, 0, 1, 1, 13, 5, 2, 1, 1, 1, ...
## $ blocks      <int> 3, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0,...
## $ date        <date> 2012-10-30, 2012-10-30, 2012-10-30, 2012-10-30, 2...
## $ fouls       <int> 3, 3, 3, 3, 2, 1, 1, 1, 1, 2, 4, 3, 4, 2, 5, 3, 0,...
## $ minutes     <int> 37, 36, 35, 29, 29, 31, 19, 11, 7, 6, 44, 41, 32, ...
## $ player      <fctr> Chris Bosh, Mario Chalmers, Dwyane Wade, Shane Ba...
## $ points      <int> 19, 8, 29, 6, 26, 19, 10, 0, 0, 3, 20, 23, 9, 15, ...
## $ position    <fctr> PF, PG, SG, SF, SF, SG, PF, PF, SF, PG, PG, SF, P...
## $ rebounds    <int> 10, 1, 3, 2, 10, 2, 5, 3, 0, 0, 7, 5, 12, 11, 1, 0...
## $ starter     <int> 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0,...
## $ steals      <int> 0, 3, 2, 1, 2, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 1, 1,...
## $ team        <fctr> MIA, MIA, MIA, MIA, MIA, MIA, MIA, MIA, MIA, MIA,...
## $ turnovers   <int> 1, 1, 4, 0, 0, 0, 0, 1, 0, 1, 4, 0, 5, 1, 1, 3, 1,...
## $ fd          <dbl> 37.5, 30.7, 40.6, 11.9, 46.5, 24.4, 19.5, 2.6, 1.5...
## $ date_num    <dbl> 15643, 15643, 15643, 15643, 15643, 15643, 15643, 1...
## $ season_code <dbl> 20122013, 20122013, 20122013, 20122013, 20122013, ...
## $ home_team   <fctr> MIA, MIA, MIA, MIA, MIA, MIA, MIA, MIA, MIA, MIA,...
## $ away_team   <fctr> BOS, BOS, BOS, BOS, BOS, BOS, BOS, BOS, BOS, BOS,...
## $ homeaway    <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0,...
## $ fd_5        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fd_15       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fd_25       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fd_35       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fd_45       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fd_55       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_5   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_15  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_25  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_35  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_45  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_55  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_5       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_15      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_25      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_35      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_45      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_55      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_5       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_15      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_25      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_35      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_45      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_55      <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fd_1        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fd_2        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ fd_3        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_1   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_2   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ minutes_3   <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_1       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FGA_3       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_1       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_2       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ FTA_3       <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ b2b         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ opp_b2b     <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ team_fd     <dbl> 218.7, 218.7, 218.7, 218.7, 218.7, 218.7, 218.7, 2...
## $ opp_fd      <dbl> 189.2, 189.2, 189.2, 189.2, 189.2, 189.2, 189.2, 1...
<<<<<<< HEAD
## $ opp_fd_1    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ opp_fd_2    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ opp_fd_3    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ opp_g_fd_1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ opp_g_fd_2  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ opp_g_fd_3  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ opp_p_fd_1  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ opp_p_fd_2  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ opp_p_fd_3  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

A quick glimpse at the player data table reveals that most variables either continuous integers or doubles. Most of the text based variables, such as team or name, are factors. All of the rolling time-window variables appear to be “NA”, however, this is due to the ordering of the data. For example, there are not enough games at the beginning to compute a 15 day average.

When exploring the impacts on a player’s FD points, the first question that comes to mind is about the distribution of FD points.

======= ## $ opp_fd_5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_fd_15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_fd_25 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_fd_35 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_fd_45 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_fd_55 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_g_fd_5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_g_fd_15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_g_fd_25 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_g_fd_35 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_g_fd_45 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_g_fd_55 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_p_fd_5 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_p_fd_15 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_p_fd_25 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_p_fd_35 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_p_fd_45 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA... ## $ opp_p_fd_55 <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...

A quick glimpse at the player data table reveals that most variables either continuous integers or doubles. Most of the text based variables, such as team or name, are factors. All of the rolling time-window variables appear to be “NA”, however, this is due to the ordering of the data. For example, there are not enough games at the beginning to compute a 15 day average.

##                                                 gameID      
##  20151206-los-angeles-lakers-at-detroit-pistons    :    52  
##  20151206-golden-state-warriors-at-brooklyn-nets   :    48  
##  20151206-dallas-mavericks-at-washington-wizards   :    40  
##  20151206-sacramento-kings-at-oklahoma-city-thunder:    40  
##  20151206-phoenix-suns-at-memphis-grizzlies        :    38  
##  20121031-dallas-mavericks-at-utah-jazz            :    26  
##  (Other)                                           :102989  
##     opponent          3FGA             3FGM              FGA        
##  LAL    : 3499   Min.   : 0.000   Min.   : 0.0000   Min.   : 0.000  
##  PHO    : 3487   1st Qu.: 0.000   1st Qu.: 0.0000   1st Qu.: 4.000  
##  BKN    : 3480   Median : 1.000   Median : 0.0000   Median : 7.000  
##  MIN    : 3480   Mean   : 2.097   Mean   : 0.7454   Mean   : 7.941  
##  SAC    : 3480   3rd Qu.: 3.000   3rd Qu.: 1.0000   3rd Qu.:11.000  
##  SA     : 3475   Max.   :22.000   Max.   :12.0000   Max.   :50.000  
##  (Other):82332                                                      
##       FGM             FTA              FTM            assists      
##  Min.   : 0.00   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 1.00   1st Qu.: 0.000   1st Qu.: 0.000   1st Qu.: 0.000  
##  Median : 3.00   Median : 2.000   Median : 1.000   Median : 1.000  
##  Mean   : 3.59   Mean   : 2.192   Mean   : 1.653   Mean   : 2.108  
##  3rd Qu.: 5.00   3rd Qu.: 4.000   3rd Qu.: 2.000   3rd Qu.: 3.000  
##  Max.   :24.00   Max.   :39.000   Max.   :25.000   Max.   :21.000  
##                                                                    
##      blocks             date                fouls          minutes     
##  Min.   : 0.0000   Min.   :2012-10-30   Min.   :0.000   Min.   : 1.00  
##  1st Qu.: 0.0000   1st Qu.:2013-10-30   1st Qu.:1.000   1st Qu.:15.00  
##  Median : 0.0000   Median :2014-10-29   Median :2.000   Median :24.00  
##  Mean   : 0.4669   Mean   :2014-07-24   Mean   :1.931   Mean   :23.07  
##  3rd Qu.: 1.0000   3rd Qu.:2015-10-28   3rd Qu.:3.000   3rd Qu.:32.00  
##  Max.   :12.0000   Max.   :2016-04-13   Max.   :6.000   Max.   :60.00  
##                                                                        
##               player           points       position      rebounds     
##  Tristan Thompson:   327   Min.   : 0.000   C :16572   Min.   : 0.000  
##  Evan Turner     :   326   1st Qu.: 4.000   F :    0   1st Qu.: 1.000  
##  Corey Brewer    :   325   Median : 8.000   G :    0   Median : 3.000  
##  Monta Ellis     :   325   Mean   : 9.578   PF:22839   Mean   : 4.099  
##  DeAndre Jordan  :   323   3rd Qu.:14.000   PG:22280   3rd Qu.: 6.000  
##  Jeff Green      :   322   Max.   :62.000   SF:19891   Max.   :29.000  
##  (Other)         :101285                    SG:21651                   
##     starter           steals            team         turnovers     
##  Min.   :0.0000   Min.   :0.0000   SA     : 3805   Min.   : 0.000  
##  1st Qu.:0.0000   1st Qu.:0.0000   DAL    : 3647   1st Qu.: 0.000  
##  Median :0.0000   Median :0.0000   GS     : 3594   Median : 1.000  
##  Mean   :0.4767   Mean   :0.7406   BKN    : 3568   Mean   : 1.323  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   OKC    : 3535   3rd Qu.: 2.000  
##  Max.   :1.0000   Max.   :9.0000   LAC    : 3520   Max.   :11.000  
##                                    (Other):81564                   
##        fd           date_num      season_code         home_team    
##  Min.   :-4.00   Min.   :15643   Min.   :20122013   SA     : 3663  
##  1st Qu.: 8.60   1st Qu.:16008   1st Qu.:20132014   OKC    : 3559  
##  Median :17.10   Median :16372   Median :20142015   BKN    : 3536  
##  Mean   :18.75   Mean   :16275   Mean   :20137079   DAL    : 3533  
##  3rd Qu.:26.90   3rd Qu.:16736   3rd Qu.:20152016   GS     : 3520  
##  Max.   :89.00   Max.   :16904   Max.   :20152016   LAC    : 3487  
##                                                     (Other):81935  
##    away_team        homeaway           fd_5           fd_15      
##  SA     : 3617   Min.   :0.0000   Min.   :-0.20   Min.   :-0.20  
##  DAL    : 3549   1st Qu.:0.0000   1st Qu.:10.82   1st Qu.:11.17  
##  GS     : 3527   Median :1.0000   Median :17.64   Median :17.57  
##  BKN    : 3512   Mean   :0.5002   Mean   :18.93   Mean   :18.89  
##  UTA    : 3503   3rd Qu.:1.0000   3rd Qu.:25.68   3rd Qu.:25.21  
##  PHO    : 3502   Max.   :1.0000   Max.   :72.28   Max.   :60.02  
##  (Other):82023                    NA's   :3539    NA's   :3539   
##      fd_25           fd_35           fd_45           fd_55      
##  Min.   :-0.20   Min.   :-0.20   Min.   :-0.20   Min.   :-0.20  
##  1st Qu.:11.28   1st Qu.:11.36   1st Qu.:11.42   1st Qu.:11.49  
##  Median :17.51   Median :17.47   Median :17.42   Median :17.38  
##  Mean   :18.87   Mean   :18.85   Mean   :18.84   Mean   :18.83  
##  3rd Qu.:25.06   3rd Qu.:24.96   3rd Qu.:24.92   3rd Qu.:24.89  
##  Max.   :56.56   Max.   :55.81   Max.   :54.90   Max.   :53.89  
##  NA's   :3539    NA's   :3539    NA's   :3539    NA's   :3539   
##    minutes_5       minutes_15      minutes_25      minutes_35   
##  Min.   : 1.00   Min.   : 1.00   Min.   : 1.00   Min.   : 1.00  
##  1st Qu.:16.20   1st Qu.:16.33   1st Qu.:16.40   1st Qu.:16.47  
##  Median :23.80   Median :23.73   Median :23.72   Median :23.66  
##  Mean   :23.27   Mean   :23.25   Mean   :23.25   Mean   :23.24  
##  3rd Qu.:31.00   3rd Qu.:30.73   3rd Qu.:30.60   3rd Qu.:30.49  
##  Max.   :45.80   Max.   :43.00   Max.   :42.40   Max.   :42.40  
##  NA's   :3539    NA's   :3539    NA's   :3539    NA's   :3539   
##    minutes_45      minutes_55        FGA_5            FGA_15      
##  Min.   : 1.00   Min.   : 1.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:16.52   1st Qu.:16.60   1st Qu.: 4.400   1st Qu.: 4.533  
##  Median :23.62   Median :23.60   Median : 7.200   Median : 7.267  
##  Mean   :23.24   Mean   :23.24   Mean   : 8.012   Mean   : 7.999  
##  3rd Qu.:30.42   3rd Qu.:30.36   3rd Qu.:11.000   3rd Qu.:11.000  
##  Max.   :42.40   Max.   :42.40   Max.   :31.200   Max.   :25.667  
##  NA's   :3539    NA's   :3539    NA's   :3539     NA's   :3539    
##      FGA_25           FGA_35           FGA_45           FGA_55      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.560   1st Qu.: 4.600   1st Qu.: 4.600   1st Qu.: 4.636  
##  Median : 7.240   Median : 7.229   Median : 7.244   Median : 7.218  
##  Mean   : 7.991   Mean   : 7.984   Mean   : 7.980   Mean   : 7.979  
##  3rd Qu.:10.920   3rd Qu.:10.857   3rd Qu.:10.844   3rd Qu.:10.800  
##  Max.   :24.960   Max.   :24.057   Max.   :23.600   Max.   :23.600  
##  NA's   :3539     NA's   :3539     NA's   :3539     NA's   :3539    
##      FTA_5            FTA_15           FTA_25           FTA_35      
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 0.800   1st Qu.: 0.933   1st Qu.: 0.960   1st Qu.: 0.960  
##  Median : 1.600   Median : 1.733   Median : 1.760   Median : 1.743  
##  Mean   : 2.214   Mean   : 2.214   Mean   : 2.213   Mean   : 2.210  
##  3rd Qu.: 3.200   3rd Qu.: 3.067   3rd Qu.: 3.000   3rd Qu.: 3.000  
##  Max.   :16.200   Max.   :12.933   Max.   :12.800   Max.   :12.800  
##  NA's   :3539     NA's   :3539     NA's   :3539     NA's   :3539    
##      FTA_45           FTA_55            b2b            opp_b2b      
##  Min.   : 0.000   Min.   : 0.000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.: 0.978   1st Qu.: 0.982   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median : 1.756   Median : 1.745   Median :0.0000   Median :0.0000  
##  Mean   : 2.208   Mean   : 2.207   Mean   :0.2266   Mean   :0.2282  
##  3rd Qu.: 3.000   3rd Qu.: 3.000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :12.800   Max.   :12.800   Max.   :1.0000   Max.   :1.0000  
##  NA's   :3539     NA's   :3539                                      
##     team_fd          opp_fd         opp_fd_5       opp_fd_15    
##  Min.   :121.2   Min.   :121.2   Min.   :152.2   Min.   :162.8  
##  1st Qu.:180.9   1st Qu.:180.8   1st Qu.:188.4   1st Qu.:189.9  
##  Median :196.1   Median :195.9   Median :196.5   Median :196.4  
##  Mean   :197.2   Mean   :197.0   Mean   :196.9   Mean   :196.7  
##  3rd Qu.:211.7   3rd Qu.:211.4   3rd Qu.:205.0   3rd Qu.:203.3  
##  Max.   :461.4   Max.   :461.4   Max.   :269.3   Max.   :240.5  
##                                  NA's   :1612    NA's   :1612   
##    opp_fd_25       opp_fd_35       opp_fd_45       opp_fd_55    
##  Min.   :162.8   Min.   :162.8   Min.   :162.8   Min.   :162.8  
##  1st Qu.:190.3   1st Qu.:190.3   1st Qu.:190.4   1st Qu.:190.5  
##  Median :196.5   Median :196.4   Median :196.2   Median :196.1  
##  Mean   :196.5   Mean   :196.4   Mean   :196.2   Mean   :196.1  
##  3rd Qu.:202.7   3rd Qu.:202.4   3rd Qu.:202.0   3rd Qu.:201.7  
##  Max.   :231.7   Max.   :232.0   Max.   :229.2   Max.   :227.4  
##  NA's   :1612    NA's   :1612    NA's   :1612    NA's   :1612   
##    opp_g_fd_5     opp_g_fd_15     opp_g_fd_25     opp_g_fd_35   
##  Min.   : 67.9   Min.   : 72.5   Min.   : 72.5   Min.   : 72.5  
##  1st Qu.:107.6   1st Qu.:108.7   1st Qu.:109.1   1st Qu.:109.2  
##  Median :118.4   Median :118.3   Median :118.4   Median :118.5  
##  Mean   :120.4   Mean   :120.3   Mean   :120.2   Mean   :120.1  
##  3rd Qu.:131.5   3rd Qu.:130.1   3rd Qu.:129.1   3rd Qu.:128.6  
##  Max.   :206.9   Max.   :186.0   Max.   :182.3   Max.   :178.9  
##  NA's   :1612    NA's   :1612    NA's   :1612    NA's   :1612   
##   opp_g_fd_45     opp_g_fd_55      opp_p_fd_5      opp_p_fd_15    
##  Min.   : 72.5   Min.   : 72.5   Min.   :  2.64   Min.   : 15.66  
##  1st Qu.:109.2   1st Qu.:109.2   1st Qu.: 66.14   1st Qu.: 67.41  
##  Median :118.8   Median :118.8   Median : 77.90   Median : 78.02  
##  Mean   :120.0   Mean   :120.0   Mean   : 76.37   Mean   : 76.32  
##  3rd Qu.:128.4   3rd Qu.:128.0   3rd Qu.: 86.88   3rd Qu.: 85.83  
##  Max.   :176.0   Max.   :176.0   Max.   :139.24   Max.   :131.42  
##  NA's   :1612    NA's   :1612    NA's   :1612     NA's   :1612    
##   opp_p_fd_25      opp_p_fd_35      opp_p_fd_45      opp_p_fd_55    
##  Min.   : 26.49   Min.   : 27.56   Min.   : 29.71   Min.   : 29.79  
##  1st Qu.: 67.68   1st Qu.: 67.91   1st Qu.: 68.14   1st Qu.: 68.08  
##  Median : 78.11   Median : 78.24   Median : 78.31   Median : 78.33  
##  Mean   : 76.28   Mean   : 76.23   Mean   : 76.17   Mean   : 76.10  
##  3rd Qu.: 85.46   3rd Qu.: 84.95   3rd Qu.: 84.73   3rd Qu.: 84.56  
##  Max.   :131.42   Max.   :131.42   Max.   :131.42   Max.   :131.42  
##  NA's   :1612     NA's   :1612     NA's   :1612     NA's   :1612

When exploring the impacts on a player’s FD points, the first question that comes to mind is about the distribution of FD points.

>>>>>>> c0c44be4afe25d83acf1beab31198d2bc62e5376
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   -4.00    8.60   17.10   18.75   26.90   89.00

For the NBA, the FD points distribution is positvely skewed. There is a large number of occurences where players score 0 FD points, despite having played more than 0 minutes in the game. This is due to players recording no statisitcs during their time on the court. The mean and median FD points for all players is 18.8 & 17.1, respectively.

Another interesting observation is that a few players actually get negative FD points, due to a high number of turnovers and lack of other offensive production.

Next, we examine a breakdown of the different player positions.

<<<<<<< HEAD

=======

>>>>>>> c0c44be4afe25d83acf1beab31198d2bc62e5376
## Source: local data table [5 x 2]
## 
## # tbl_dt [5 x 2]
##   position avg_per_game
##     <fctr>        <dbl>
## 1       PF     2.322925
<<<<<<< HEAD
## 2       PG     2.266070
## 3       SG     2.202095
## 4       SF     2.023088
## 5        C     1.685517

In general, there an approximately even number of players across all positions, with centers having slightly fewer numbers. On average, 1.7 centers play per game per team, compared to 2.0-2.3 players at the other positions.

Next, we examine the length of each game. Games that run for extended periods allow players to accumulate more points through increased playing time, leading to higher fantasy production.

======= ## 2 SG 2.202095 ## 3 SF 2.023088 ## 4 PG 2.266070 ## 5 C 1.685517

In general, there an approximately even number of players across all positions, with centers having slightly fewer numbers. On average, 1.7 centers play per game per team, compared to 2.0-2.3 players at the other positions.

Next, we examine the length of each game. Games that run for extended periods allow players to accumulate more points through increased playing time, leading to higher fantasy production.

>>>>>>> c0c44be4afe25d83acf1beab31198d2bc62e5376
##    Game_Length percentage
## 1:     Quad_OT       0.02
## 2:   Triple_OT       0.19
## 3:   Double_OT       1.02
## 4:   Single_OT       5.81
## 5:     Regular      92.97

As shown, a very small number of games go to overtime; approximately 7%.

Rolling variables over different time windows were computed for multiple variables. These rolling variables should follow very similar distributions to the underlying, un-rolled variable. As a sanity check, the distributions of minutes and minutes over the past X games is shown below.

<<<<<<< HEAD

The distributions for the rolling variables look similar to the base minutes distribution. They differ in that the rolling minutes have fewer occurences of low-minute games. Low-minute games (i.e <5 min) are presumably due to cases where a player was injured mid-game, a player was brought in to close a blow-out game, or the player was pulled due to poor play. Intuitevely, these cases should not occur over consecutive games.

=======

The distributions for the rolling variables look similar to the base minutes distribution. They differ in that the rolling minutes have fewer occurences of low-minute games. Low-minute games (i.e <5 min) are presumably due to cases where a player was injured mid-game, a player was brought in to close a blow-out game, or the player was pulled due to poor play. Intuitevely, these cases should not occur over consecutive games.

The formula for FD points was shown earlier. The plots below show how the distributions of relevent statistics varies between positions.

player_data[, grouped_posn := ifelse(position %in% c('PG','SG','SF'),1,2)]
## Source: local data table [103,233 x 74]
## 
## # tbl_dt [103,233 x 74]
##                                   gameID opponent  3FGA  3FGM   FGA   FGM
##                                   <fctr>   <fctr> <int> <int> <int> <int>
## 1  20121030-boston-celtics-at-miami-heat      BOS     1     0    15     8
## 2  20121030-boston-celtics-at-miami-heat      BOS     0     0    22    10
## 3  20121030-boston-celtics-at-miami-heat      BOS     4     2    16    10
## 4  20121030-boston-celtics-at-miami-heat      BOS     1     0     7     3
## 5  20121030-boston-celtics-at-miami-heat      BOS     0     0     0     0
## 6  20121030-boston-celtics-at-miami-heat      BOS     2     1     2     1
## 7  20121030-boston-celtics-at-miami-heat      BOS     2     1     5     4
## 8  20121030-boston-celtics-at-miami-heat      BOS     3     2     7     5
## 9  20121030-boston-celtics-at-miami-heat      BOS     3     2     4     2
## 10 20121030-boston-celtics-at-miami-heat      BOS     0     0     1     0
## # ... with 103,223 more rows, and 68 more variables: FTA <int>, FTM <int>,
## #   assists <int>, blocks <int>, date <date>, fouls <int>, minutes <int>,
## #   player <fctr>, points <int>, position <fctr>, rebounds <int>,
## #   starter <int>, steals <int>, team <fctr>, turnovers <int>, fd <dbl>,
## #   date_num <dbl>, season_code <dbl>, home_team <fctr>, away_team <fctr>,
## #   homeaway <dbl>, fd_5 <dbl>, fd_15 <dbl>, fd_25 <dbl>, fd_35 <dbl>,
## #   fd_45 <dbl>, fd_55 <dbl>, minutes_5 <dbl>, minutes_15 <dbl>,
## #   minutes_25 <dbl>, minutes_35 <dbl>, minutes_45 <dbl>,
## #   minutes_55 <dbl>, FGA_5 <dbl>, FGA_15 <dbl>, FGA_25 <dbl>,
## #   FGA_35 <dbl>, FGA_45 <dbl>, FGA_55 <dbl>, FTA_5 <dbl>, FTA_15 <dbl>,
## #   FTA_25 <dbl>, FTA_35 <dbl>, FTA_45 <dbl>, FTA_55 <dbl>, b2b <dbl>,
## #   opp_b2b <dbl>, team_fd <dbl>, opp_fd <dbl>, opp_fd_5 <dbl>,
## #   opp_fd_15 <dbl>, opp_fd_25 <dbl>, opp_fd_35 <dbl>, opp_fd_45 <dbl>,
## #   opp_fd_55 <dbl>, opp_g_fd_5 <dbl>, opp_g_fd_15 <dbl>,
## #   opp_g_fd_25 <dbl>, opp_g_fd_35 <dbl>, opp_g_fd_45 <dbl>,
## #   opp_g_fd_55 <dbl>, opp_p_fd_5 <dbl>, opp_p_fd_15 <dbl>,
## #   opp_p_fd_25 <dbl>, opp_p_fd_35 <dbl>, opp_p_fd_45 <dbl>,
## #   opp_p_fd_55 <dbl>, grouped_posn <dbl>
base_stats = player_data[,.(`3FGM` = sum(`3FGM`)/.N , FGM = sum(FGM)/.N, FTM = sum(FTM)/.N,
                          rebounds = sum(rebounds)/.N, assists = sum(assists)/.N,
                          blocks = sum(blocks)/.N, steals = sum(steals)/.N,
                          turnovers = sum(turnovers)/.N, fouls = sum(fouls)/.N),
                         by = .(gameID, position)]

d = melt(base_stats, id.vars = c("gameID", "position"))

ggplot(d,aes(x = value, y = ..count.., colour = position)) + facet_wrap(~variable,scales = "free_x") +
  geom_density() + theme_dlin() + scale_y_continuous(limits = c(0, 10000))

Several conclusions can be drawn from the plot above, all intuitive to someone who has watched the sport. Starting at the top left, centers make the fewest three pointers, followed by power forwards, as indicated by their positvely skewed distributions. Field goals and free throws are similar across all positions. Point guards have the highest number of assists, whereas centers have the highest number of fouls.

>>>>>>> c0c44be4afe25d83acf1beab31198d2bc62e5376

Bivariate Analysis

Through the univariate analysis, an understanding of the underlying distributions and structure of the dataset was achieved. Next, the relationships between various feature variables and a player’s FD points is explored.

A logical starting point is to look at how a player’s previous FD points predicts his points on a given night. The plot below examines one of the rolling features created, the mean of the FD points from the last 5 games.

There appears to be a strong trend between FD points and the average of previous FD points. In general, players who put up a lot of points continue to do so, with the opposite holding true as well.

While previous FD points is a strong feature itself, it doesn’t account for any factors that may impact a players performance on a nightly basis. A powerful model would be able to explain deviations from a player’s mean. To explore one of these factors, a player’s playing time is plotted against his FD points.

The plot above shows that there is a very strong relationship between minutes played and FD points, R2 ~ 0.65. Minutes correlates significantly stronger than past FD points. This is intuitive, as more time spent on the court allows players produce more, via shots, rebounds, assists etc. It also does a better job of explaining variation from a player’s mean, as some nights a player will play more/less as dictated by the flow of the game.

Some people would argue that a more informative variable is a player’s FD points per minute, a measure of efficiency.

##define player efficiency feature
player_data[, eff := fd/minutes]

While there appears to be some trend, the data is relatively noisy, due to the presence of some high efficiencies.

#examine some outliers
glimpse(player_data[eff>3,.(gameID,player,minutes,fd,`3FGM`,FGM,FTM,rebounds,assists,blocks,steals,turnovers)][0:10])
## Observations: 10
## Variables: 12
## $ gameID    <fctr> 20121104-phoenix-suns-at-orlando-magic, 20121116-ne...
## $ player    <fctr> Kyle O'Quinn, Chris Copeland, Jarvis Varnado, Quinc...
## $ minutes   <int> 1, 1, 1, 3, 2, 1, 2, 1, 2, 2
## $ fd        <dbl> 3.2, 3.7, 5.2, 9.2, 7.2, 3.5, 6.5, 3.2, 6.5, 7.6
## $ 3FGM      <int> 0, 0, 0, 0, 0, 0, 1, 0, 1, 0
## $ FGM       <int> 0, 0, 2, 1, 2, 0, 1, 1, 1, 2
## $ FTM       <int> 2, 1, 0, 4, 0, 0, 0, 0, 0, 0
## $ rebounds  <int> 1, 1, 1, 1, 1, 0, 0, 1, 0, 3
## $ assists   <int> 0, 1, 0, 0, 0, 1, 1, 0, 1, 0
## $ blocks    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
## $ steals    <int> 0, 0, 0, 1, 1, 1, 1, 0, 1, 0
## $ turnovers <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0

All of these outliers appear to be due to scenarios where players recorded multiple points, assists or rebounds in under 3 minutes; clearly un-sustainable production.

After filtering on minutes>5 and re-plotting, the positive trend between efficiency and FD points becomes more clear.

When building a predictive model, neither minutes played or efffiency will be available before the game. As a result, these variables cannot be used in a model. In order to get a quick glimpse of the predictive power of various features, a correlation matrix is used.

The homeaway variable has the weakest corrleation with FD points. As shown below, the mean fd score is only 0.6 points higher for players at home.

##mean fd points home versus away
player_data[,.(mean_score=mean(fd)),by = homeaway]
## Source: local data table [2 x 2]
## 
## # tbl_dt [2 x 2]
##   homeaway mean_score
##      <dbl>      <dbl>
## 1        1   19.08364
## 2        0   18.41699

An additional feature is created to predict minutes played. Playing time can be impacted by injuries, fouls, blowout games or games that go to overtime. In order to predict such instances, more advanced models would be required. A simple way to account for a potential increase or decrease in playing time is to measure the depth (i.e. number of players at each position).

##build position depth feature
team_depth = player_data[, .(pos_depth=.N - 1) ,by=.(gameID,team,position)]
player_data = player_data %>% inner_join(team_depth)

If a starting or backup point guard is injured, the depth at PG will decrease by 1 and signal a potential increase in playing time for the remaining point guards.

There appears to be a significant trend between positional depth and minutes played. As expected, greater positional depth leads to less playing time. There appears to be some outliers at pos_depth>5, as it is highly unlikely that a team would carry more than 5 players at a single position.

##INSPECT pos_depth > 5
glimpse(player_data[pos_depth > 5,.(team,player)][order(player)]) ##-->duplicate player records
## Observations: 16
## Variables: 2
## $ team   <fctr> LAL, LAL, GS, GS, GS, GS, LAL, LAL, GS, GS, LAL, LAL, ...
## $ player <fctr> Brandon Bass, Brandon Bass, Brandon Rush, Brandon Rush...

Upon further inspection, these instances appear to be cases where there were duplicate player records in the data.

##get all dupes
glimpse(player_data[,.(count=.N),by=.(gameID,team,player)][count>1,.(gameID,player)])
## Observations: 109
## Variables: 2
## $ gameID <fctr> 20151206-dallas-mavericks-at-washington-wizards, 20151...
## $ player <fctr> Zaza Pachulia, Ryan Hollins, DeJuan Blair, Dirk Nowitz...

As shown above, a total of 109 duplicates are present. This isn’t expected to have a large effect on the data exploration due to the relatively small number of occurences.

Next, the correlation of different rolling minutes played is examined.

The average minutes played in the last 5 & 15 games appears to be the strongest predictor of minutes played on a nightly basis.

Next, we examine if certain positions tend to record more minutes than others.

## Source: local data table [5 x 2]
## 
## # tbl_dt [5 x 2]
##   position avg_min_played
##     <fctr>          <dbl>
## 1       PF       21.47340
## 2       PG       24.69206
## 3       SF       23.71057
## 4       SG       23.39125
## 5        C       21.91842

Guards & small forwards play more than centers & power forwards on average.

Similar to the rolling minutes variables, we examine which rolling FD points variable correlates the strongest with FD points on a nightly basis.

Again, we see that rolling variables over the last 5+ games have the highest correlation with nightly performance.

Next, we see which players tend to record the most FD points.

## Source: local data table [5 x 2]
## 
## # tbl_dt [5 x 2]
##   position  mean_fd
##     <fctr>    <dbl>
## 1       SG 16.99232
## 2       SF 17.98178
## 3       PF 18.33892
## 4        C 20.02166
## 5       PG 20.62156

Point guards and centers tend to record the most FD points on a nightly basis. For the Fanduel site, lineups must have a fixed number of players at each position. However, other DFS sites allow for flex positions, so targeting point guards & centers in those spots could be an advantageous strategy.

The total fanduel points scored by each team last season is shown below.

Golden State & Oklahoma City scored the highest number of fantasy points in each game. Targeting players on these teams is a possible strategy.

The total fanduel points allowed by each team last season is shown below.

Sacremento & Charlotte had the most porous defenses from a fantasy standpoint. As a result, players competing against these teams could be targeted.

Stacking

“Stacking” is a popular concept in daily fantasy sports that involves putting players from the same team on your lineup. To explore whether stacking is a viable strategy in the NBA, it is necessary to examine the correlations between players and their teams.

First, we examine how fantasy points correlate at each position.

The cell with -0.38 represents the correlation between the total point guard fantasy points and total shooting guard fantasy points. In general, players on the same team are negatively correlated with each other. Intuitively this makes sense, as the stolen shot attempts from one player outweigh the fantasy points generated by assists. The negative correlation is less significant between perimiter & post players (i.e. PG/C or SG/C).

The correlation between opposing players at each position is less significant.

Next, correlation among the starting players is examined (as opposed to bench + startings player from above)

starter_data = dcast(player_data[starter == 1,],gameID + team ~ position, fun.aggregate = mean, value.var = c('fd'),na.rm=T)
starter_data = as.data.table(starter_data)

Again, the same trend is shown with perimiters players being negatively correlation with each other, as well as posts.

Next, I examine the impact of high scoring games. First, the team scores (i.e. actual total points scored) are joined to the player data table.

player_data = player_data %>% inner_join(event_data[,.(gameID,home_score,away_score)])
player_data[,team_score := home_score]
player_data[team == away_team, team_score := away_score]

player_data[, score_bucket := 1]
player_data[team_score > 75, score_bucket := 2]
player_data[team_score > 100, score_bucket := 3]
player_data[team_score > 125, score_bucket := 4]

And the results shown below.

As expected, high scoring games correlate directly with higher fantasy point production. Oddsmaker publish projection point totals for each NBA game, so players playing in high projected point games should be targeted.

Lastly, I examine the effect of weak rebounding teams and teams with lots of turnovers.

###opponent rebounds and turnovers
opp_stats = team_data[,.(gameID,team,rebounds,turnovers)]
setnames(opp_stats,old=c('team','rebounds','turnovers'),new=c('opponent','opp_rebounds','opp_turnovers'))

player_data = player_data %>% inner_join(opp_stats)

When opposing teams turn the ball over more frequently is does not necessarily lead to higher fantasy production.

Centers and power forwards are expected to benefit from weak rebounding teams due to more fantasy points from rebounds and second-chance baskets. The plot above shows a slightly negative trend (i.e. more rebounds by opponents = lower fantsay point production for PFs/Cs).

<<<<<<< HEAD
======= >>>>>>> c0c44be4afe25d83acf1beab31198d2bc62e5376

Multivariate Analysis

The formula for FD points was shown earlier. The plots below show how the distributions of relevant statistics vary between positions.

base_stats = player_data[,.(`3FGM` = sum(`3FGM`)/.N , FGM = sum(FGM)/.N, FTM = sum(FTM)/.N,
                          rebounds = sum(rebounds)/.N, assists = sum(assists)/.N,
                          blocks = sum(blocks)/.N, steals = sum(steals)/.N,
                          turnovers = sum(turnovers)/.N, fouls = sum(fouls)/.N),
                         by = .(gameID, position)]

d = melt(base_stats, id.vars = c("gameID", "position"))

ggplot(d,aes(x = value, y = ..count.., colour = position)) + facet_wrap(~variable,scales = "free_x") +
  geom_density() + theme_dlin() + scale_y_continuous(limits = c(0, 10000))

Several conclusions can be drawn from the plot above, all intuitive to someone who has watched the sport. Starting at the top left, centers make the fewest three pointers, followed by power forwards, as indicated by their positvely skewed distributions. Field goals and free throws are similar across all positions. Point guards have the highest number of assists, whereas centers have the highest number of fouls.

As an extension of what was shown above, I plot the total fd points scored/allowed by the teams over each season.

## Source: local data table [120 x 3]
## 
## # tbl_dt [120 x 3]
##    season_code   team mean_opp_fd
##          <dbl> <fctr>       <dbl>
## 1     20122013    MEM    176.2451
## 2     20122013     NY    180.7780
## 3     20122013    MIA    180.9512
## 4     20122013    IND    181.9099
## 5     20122013    LAC    182.2707
## 6     20122013    CHI    183.6402
## 7     20122013    BKN    185.7756
## 8     20122013    OKC    186.1476
## 9     20122013    UTA    191.1524
## 10    20122013    TOR    191.4073
## # ... with 110 more rows

## Source: local data table [120 x 3]
## 
## # tbl_dt [120 x 3]
##    season_code   team mean_team_fd
##          <dbl> <fctr>        <dbl>
## 1     20122013    CHA     183.8012
## 2     20122013     NO     185.2354
## 3     20122013    DET     186.4634
## 4     20122013    WAS     186.6134
## 5     20122013    POR     186.6537
## 6     20122013    CLE     187.1756
## 7     20122013    ORL     187.2061
## 8     20122013    PHI     188.5695
## 9     20122013    TOR     188.5963
## 10    20122013    BKN     188.9561
## # ... with 110 more rows

Final Plots and Summary

Reflection

echo=False emits the code eval=False prevents the code from running! results = ‘hide’ suppresses the output of the code